Algorithmic Complexity Bounds 
on Future Prediction Errors* 



Alexey Chernov 1 ' 4 Marcus Hutter 1 ' 3 Jiirgen Schmidhuber 1 ' 2 

1 IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland 
2 TU Munich, Boltzmannstr. 3, 85748 Garching, Miinchen, Germany 
3 RSISE/ANU/NICTA, Canberra, ACT, 0200, Australia 
4 LIF, CMI, 39 rue Joliot Curie, 13453 Marseille cedex 13, France 
{alexey,marcus,juergen}@idsia.ch, http:/ /www.idsia.ch/~{alexey,marcus,juergen} 

19 January 2007 



Abstract 

We bound the future loss when predicting any (computably) stochastic 
sequence online. Solomonoff finitely bounded the total deviation of his uni- 
versal predictor M from the true distribution \i by the algorithmic complexity 
of \i. Here we assume that we are at a time t > 1 and have already observed 
x = x\...x t . We bound the future prediction performance on xt+\Xt+2--- ky a 
new variant of algorithmic complexity of \i given x, plus the complexity of the 
randomness deficiency of x. The new complexity is monotone in its condition 
in the sense that this complexity can only decrease if the condition is pro- 
longed. We also briefly discuss potential generalizations to Bayesian model 
classes and to classification problems. 
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1 Introduction 



We consider the problem of online=sequential predictions. We assume that 
the sequences x = X1X2X3... are drawn from some "true" but unknown prob- 
ability distribution fx. Bayesians proceed by considering a class M. of mod- 
els=hypotheses=distributions, sufficiently large such that fiEAi, and a prior over 
M. Solomonoff considered the truly large class that contains all computable prob- 
ability distributions [Sol64]. He showed that his universal distribution M con- 
verges rapidly to fi [Sol78], i.e. predicts well in any environment as long as it 
is computable or can be modeled by a computable probability distribution (all 
physical theories are of this sort). M(x) is roughly 2~ K ( X \ where K(x) is the 
length of the shortest description of x, called the Kolmogorov complexity of x. 
Since K and M are incomputable, they have to be approximated in practice. 
See e.g. [Sch02b, Hut05, LV97, CV05] and references therein. The universality 
of M also precludes useful statements about the prediction quality at particular 
time instances n [Hut05, p. 62], as opposed to simple classes like i.i.d. sequences 
(data) of size n, where accuracy is typically 0{n~ 1 / 2 ). Luckily, bounds on the 
expected fo£a/=cumulative loss (e.g. number of prediction errors) for M can be de- 
rived [Sol78, HutOlc, Hut03a, Hut03b], which is often sufficient in an online setting. 
The bounds are in terms of the (Kolmogorov) complexity of fx. For instance, for 
deterministic fx, the number of errors is (in a sense tightly) bounded by K{fx) which 
measures in this case the information (in bits) in the observed infinite sequence x. 

What's new. In this paper we assume we are at a time t > 1 and have already 
observed x = x\...x t . Hence we are interested in the future prediction performance 
on x t +ix t +2---, since typically we don't care about past errors. If the total loss is 
finite, the future loss must necessarily be small for large t. In a sense the paper 
intends to quantify this apparent triviality. If the complexity of fx bounds the total 
loss, a natural guess is that something like the conditional complexity of fx given x 
bounds the future loss. (If x contains a lot of (or even all) information about fi, we 
should make fewer (no) errors anymore.) Indeed, we prove two bounds of this kind 
but with additional terms describing structural properties of x. These additional 
terms appear since the total loss is bounded only in expectation, and hence the 
future loss is small only for "most" X\...x t . In the first bound (Theorem 1), the 
additional term is the complexity of the length of x (a kind of worst-case estimation). 
The second bound (Theorem 7) is finer: the additional term is the complexity of 
the randomness deficiency of x. The advantage is that the deficiency is small for 
"typical" x and bounded on average (in contrast to the length). But in this case the 
conventional conditional complexity turned out to be unsuitable. So we introduce a 
new natural modification of conditional Kolmogorov complexity, which is monotone 
as a function of condition. Informally speaking, we require programs (=descriptions) 
to be consistent in the sense that if a program generates some fx given x, then it 
must generate the same fx given any prolongation of x. The new posterior bounds 
also significantly improve upon the previous total bounds. 
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Contents. The paper is organized as follows. Some basic notation and definitions 
are given in Sections 2 and 3. In Section 4 we prove and discuss the length-based 
bound Theorem 1. In Section 5 we show why a new definition of complexity is neces- 
sary and formulate the deficiency-based bound Theorem 7. We discuss the definition 
and basic properties of the new complexity in Section 6, and prove Theorem 7 in 
Section 7. We briefly discuss potential generalizations to general model classes M, 
and classification in the concluding Section 8. 



2 Notation & Definitions 

We essentially follow the notation of [LV97, Hut05] . 

Strings and natural numbers. We write X* for the set of finite strings over a 
finite alphabet X, and X°° for the set of infinite sequences. The cardinality of a 
set S is denoted by |<S|. We use letters i,k,l,n,t for natural numbers, u,v,x,y,z for 
finite strings, e for the empty string, and a = ai :oo etc. for infinite sequences. For 
a string x of length £(x) = n we write with x t <EX and further abbreviate 

Xk:n'-=XkXk+i..-x n -ix n and x <n \=x\...x n -i. For x t &X, denote by x t an (arbitrary) 
element from X such that Xt^Xt- For binary alphabet A" = {0,1}, the Xt is uniquely 
defined. We occasionally identify strings with natural numbers. 

Prefix sets. A string x is called a (proper) prefix of y if there is a e) such 
that xz = y; y is called a prolongation of x. We write x* = y in this case, where * 
is a wildcard for a string, and similarly for the case where y is an infinite sequence. 
A set of strings is called prefix free if no element is a proper prefix of another. 
Any prefix-free set V has the important property of satisfying Kraft's inequality 

X -4- 

Asymptotic notation. We write f(x)<g(x) for f(x) — 0(g(x)) and f(x)<g(x) 
for f(x) < g(x) +0(1). Equalities =, = are defined similarly: they hold if the 
corresponding inequalities hold in both directions. 

(Semi) measures. We call p:X*— > [0,1] a semimeasure iff eX p(xi :n ) < p(x <n ) 
and p(e) < 1, and a measure iff both unstrict inequalities are equalities. p(x) is 
interpreted as the p-probability of sampling a sequence which starts with x. The 
conditional probability (posterior) 

is the p-probability that a string x is followed by (continued with) y. If p(x) = 0, 
p{y\x) is defined arbitrarily and every such function is called a version of conditional 
probability. We call p deterministic if 3a : p{a 1 . n ) = 1 Vn. In this case we identify p 
with a. 

Random events and expectations. We assume that sequence u—coi :oo is sampled 
from the "true" measure p, i.e. P[ui :n =x 1:n ] =/i(x 1:n ). We denote expectations w.r.t. 



3 



/i by E, i.e. for a function / : X n -> R, E[f] =B[f(u 1:n )] =^ xi:n ^(xi.. n )f(x 1:n ). We 
abbreviate ji t : = ji{-\oj <t ). 

Enumerable sets and functions. A set of strings (or naturals, or other construc- 
tive objects) is called enumerable if it is the range of some computable function. A 
function /: X*^M is called (co-) enumerable if the set of pairs {(x,^) \f(x) ( >^} is 
enumerable. A measure \x is called computable if it is enumerable and co-enumerable 
and the set {x\ fi(x) = 0} is decidable (i.e. enumerable and co-enumerable). 

To simplify the statements of the theorems below, we assume that for every com- 
putable measure /x, there is one fixed computable version of conditional probability 
/i(y\x), for example, (J,(y\x) is the uniform measure on y's for /i(x) — 0. 

Prefix Kolmogorov complexity. The conditional prefix complexity K(y\x) := 
mm{£(p) : U (p,x) = y} is the length of the shortest binary (self-delimiting) pro- 
gram p G {0,1}* on a universal prefix Turing machine U with output y G X* 
and input x G X* [LV97]. K(x) := K(x\e). For non-string objects o we define 
K(o) := K((o)), where (o) G X* is some standard code for o. In particular, if 
(fi)iZi is an enumeration of all (co-)enumerable functions, we define K(fi):—K(i). 
We need the following properties: The co-enumerability of K, the upper bounds 
K(x\£(x))^£(x)\og 2 \X\ and K(n)^2\og 2 n, Kraft's inequality Y, x 2 ~ K{x) < 1, the 
lower bound K(x) >l(x) for "most" x and Kin) oo, extra information bounds 

+ + / \ -r + 

K(x\y)<K(x)<K(x,yJ, subadditivity K(xy)<K{x,y)<K{y) + K{x\y), information 
non-increase K (f (x))<K (x) + K '(/) for computable f:X*^X*, and coding relative 
to a probability distribution (MDL): if P : X* — > [0,1] is enumerable and ^ a .-P(^) < 1, 
then ir(a;)^-log 2 P(a;) + K(P). 

Monotone and Solomonoff complexity. The monotone complexity Km(x) := 
mm{£(p) : = x*} is the length of the shortest binary (possibly non-halting) 
program pG {0,1}* on a universal monotone Turing machine U which outputs a 
string starting with x. Solomonoff 's prior M(x) '■—Ylp-u(p)=x*'^~ e ^~''^~ KM ^ ^ s t ne 
probability that U outputs a string starting with x if provided with fair coin flips 
on the input tape. Most complexities coincide within an additive term 0(log£(x)), 
e.g. K(x\£(x))<KM(x)<Km(x)<K(x) : hence similar relations as for K hold. 

3 Setup 

Convergent predictors. We assume that fi is a "true" 1 sequence generating mea- 
sure, also called an environment. If we know the generating process /i, and given 
past data x <t , we can predict the probability fi(x t \x <t ) of the next data item x t . 
Usually we do not know fi, but estimate it from x<t- Let p{xt\x < t) be an estimated 
probability 2 of x t , given x <t . Closeness of p(x t \x <t ) to fj,(x t \x <t ) is desirable as a 

^^Also called objective or aleatory probability or chance. 
2 Also called subjective or belief or epistemic probability. 
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goal in itself or when performing a Bayes decision y t that has minimal p-expected 
loss lt(x <t ) :=mm yt '^2 x Loss(xt,yt)p(xt\x<t)- Consider, for instance, a weather data 
sequence x\ :n with x t — 1 meaning rain and x t = meaning sun at day t. Given x <t 
the probability of rain tomorrow is p(l\x <t ). A weather forecaster may announce the 
probability of rain to be y t :—p(l\x <t ), which should be close to the true probability 
p(l\x <t ). To aim for 

/ / I \ / / I \ if as t) r\ 

p{x t \x < t) — p{x t \x <t ) — > as t — > oo 

seems reasonable. 

Convergence in mean sum. We can quantify the deviation of p t from p t , e.g. by 
the squared difference 

s t (u} <t ) := ^2{p{x t \uj <t ) - p(x t \u <t )) 2 = ^2(p t -p t ) 2 

xt&X x t 

Alternatively one may also use the squared absolute distance s t '■— \(^2 xt \pt — Ht\) 2 , 
the Hellinger distance s t := ^2 xt (-\/Pt — \fWti 1 the KL-divergence s t :=Y2 xt p t \n^, or 
the squared Bayes regret St :— \{lt~ It) 2 f° r h £ [0)1] ■ For all these distances one 
can show [HutOlb, Hut03a, Hut05] that their cumulative expectation from I to n is 
bounded as follows: 

= : A:n(w<i)- ( 2 ) 

Di :n is increasing in n, hence Di :OQ e [0,oo] exists [HutOla, Hut05]. A sequence of 
random variables like s t is said to converge to zero with probability 1 if the set 
{uj : s t {u) 0} has measure 1. s t is said to converge to zero in mean sum if 
^^ 1 E[|s i |] < c < oo, which implies convergence with probability 1 (rapid if c is 
of reasonable size). Therefore a small finite bound on Di :00 would imply rapid 
convergence of the s t defined above to zero, hence p t — > p t and l p t — > 1% fast. So 
the crucial quantities to consider and bound (in expectation) are lng if 1 = 1 and 
^ n p(t>|a) ^ or ^1- For illustration we will sometimes loosely interpret Di :oo an d other 
quantities as the number of prediction errors, as for the error-loss they are closely 
related to it [HutOlc, HutOla]. 

Bayes mixtures. A Bayesian considers a class of distributions M. := {1/1,1/2,...}, 
large enough to contain p, and uses the Bayes mixture 

£(x) := w v -u(x), y~]w v = l, w u >0 (3) 
ueM ueM 

for prediction, where w u can be interpreted as the prior of (or initial belief in) v. 
The dominance 

i{x) > w^-p(x) \/x e X* (4) 



< E 



< E 



In 



p,{<jJl:n\u<l) 



p(Wi:„|w<i) 



U<1 
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is its most important property. Using p = £ for prediction, this implies Di :00 < 
lnw" 1 < oo, hence £ t — * /if. If is chosen sufficiently large, then pe M is not a 
serious constraint. 

Solomonoff prior. So we consider the largest (from a computational point of 
view) relevant class, the class M.\j of all enumerable semimeasures (which includes 
all computable probability distributions) and choose w v = 2~ K ^ which is biased 
towards simple environments (Occam's razor). This gives us Solomonoff-Levin's 
prior M [Sol64, ZL70] (this definition coincides within an irrelevant multiplicative 
constant with the one in Section 2). In the following we assume M=Mu, P = £,=M, 
w u = 2~ K ^ and p e Mu being a computable (proper) measure, hence M(x) > 
2- K ^p J (x)Vx by (4). 

Prediction of deterministic environments. Consider a computable sequence 
tt = «i:oo "sampled from fj,<EA4" with /x(a) = l, i.e. \x is deterministic, then from (4) 
we get 

oo oo 

^|l-M(a t ]a <t )| < -^liiM(a t |a< t ) = -lnM(a 1:00 ) < X( A t)ln2<oo, (5) 
t=i t=i 

which implies that M(a t \a <t ) converges rapidly to 1 and hence M(a t \a <t ) — *0, i.e. 
asymptotically M correctly predicts the next symbol. The number of prediction 
errors is of the order of the complexity K{p)=Km{a) of the sequence. 

For binary alphabet this is the best we can expect, since at each time-step only 
a single bit can be learned about the environment, and only after we "know" the 
environment we can predict correctly. For non-binary alphabet, K(fi) still measures 
the information in fi in bits, but feedback per step can now be log 2 |A'| bits, so we 
may expect a better bound K(fi) /log 2 \X\- But in the worst case all «(G{0,1}C ( Y. 
So without structural assumptions on ji the bound cannot be improved even if X is 
huge. We will see how our posterior bounds can help in this situation. 

Individual randomness (deficiency). Let us now consider a general (not nec- 
essarily deterministic) computable measure peM. The Shannon- Fano code of x 
w.r.t. /i has code-length |~— log 2 /i(a;)] , which is "optimal" for "typical/random" x 
sampled from fi. Further, — log 2 M(x) ~ K(x) is the length of an "optimal" code 
for x. Hence — log 2 p(x) ~— log 2 M(:r) for "/i- typical/random" x. This motivates the 
definition of fi-randomness deficiency 

which is small for "typical/random" x. Formally, a sequence a is called (Martin- 
Lof) random iff d^(a) : = sup n d M (ai :n ) <oo, i.e. iff its Shannon-Fano code is "optimal" 
(note that d^(a) > —K(fi) > — oo for all sequences), i.e. iff 



sup 

n 



5> 



H(a t \a <t ) n(a v . 



, M(a t \a <t ) S „ P M(a 1:n ) 



log- 



< oo. 
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Unfortunately this does not imply M t — > jj, t on the //-random a, since M t may oscil- 
late around [i t , which indeed can happen [HM04]. But if we take the expectation, 
Solomonoff [Sol78, HutOla, Hut05] showed 

oo 

< VEV(M t - ft ) 2 < D 1:oo = limE[-^(u; 1:n )]m2 < K{fi)ln2 < oo , (6) 

*■ — * — * rt-^oo 
t=l x t 

hence, M t — >/i t with //-probability 1. So in any case, is an important quantity, 

since the smaller —d^x) (at least in expectation) the better M predicts. 

4 Posterior Bounds 



Posterior bounds. Both bounds (5) and (6) bound the total (cumulative) discrep- 
ancy (error) between M t and ji t - Since the discrepancy sum £) 1:00 is finite, we know 
that after sufficiently long time t — l, we will make few further errors, i.e. the future 
error sum Di :OQ is small. The main goal of this paper is to quantify this asymptotic 
statement. So we need bounds on log 2 $1^1) > wriere x are the past and y the future 
observations. Since \og 2 j^<K(fi) and fx(y\x)/M(y\x) are conditional versions of 
true/universal distributions, it seems natural that the unconditional bound K(/i) 
also simply conditionalizes to log 2 ^K{jj,\x). The more information the past 
observation x contains about /a, the easier it is to code \x i.e. the smaller K(p\x) is, 
and hence the less future predictions errors Di :00 we should make. Once x contains 
all information about li, i.e. K(li\x)=0, we should make no errors anymore. More 
formally, optimally coding x, then /i\x, and finally y\fi,x by Shannon-Fano gives a 
code for xy, hence K{xy) < K(x) + K(fj,\x) — log 2 //(y|x). Since K(z) ~ — log 2 M(z), 
this implies log 2 j [^j^ but with a logarithmic fudge that tends to infinity 

as £(y) — ■> oo, which is unacceptable. The //-independent bound we need was first 
stated in [Hut05, Prob.2.6(m)]: 

Theorem 1. For any computable measure \i and any x,yEX* it holds that 

l0g2 ^j|- ir(/i|x) + ^ (a;)) - 
Proof. For every I we define the following function of zEX*. For £(z)>l, 

tf{z) := J2 2 ~ K(UlZl:l)M ( z ^ z ^))- 

For £(z) </, we extend ip l by defining ip l (z) '■=Yl l u-i(u)=i-e(z)' l l ,l ( zu )- ^ * s eas y to see 
that ip l is an enumerable semimeasure. By the definition of M, we have 

M(z) > 2- K ^ip l (z) 
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for any I and z. Now let / = £(x) and z = xy. Let us define a computable measure 
H x (y):=[i(y\x). Then 

M(xy) > 2- K ^\xy) > 2- K ^ l) 2' K{ ^ x) M(x)^ x (y) . 

Taking the logarithm, after trivial transformations, we get 

M(y\x) 

To complete the proof, let us note that K(ip l )^K(l) and K (/i x \x)^K (fJ,\x) . □ 

Corollary 2. The future deviation of M t from ji t is bounded by 

EZi+i^tM < A+i:oo(wi:i) t (K(n\u 1:l )+K(l))\n2 (i) 

For St being squared (absolute) distance, Hellinger distance, or squared Bayes regret, 
the total deviation of M t from [it is bounded by 

£ t =iE[st] £ mm l {E[K( l 2\uj ld )+K(l)}\n2 + 2l} (ii) 

Proof, (i) The first inequality is (2) and the second follows by taking the conditional 
expectation E[-|cji : /] in Theorem 1. (ii) follows from (i) by taking the unconditional 
expectation and from X^ =1 E[s t ] ^2Z, since s t <2 for these distances [Hut05]. □ 

Examples and more motivation. The bounds Theorem 1 and Corollary 2(i) 
prove and quantify the intuition that the more we know about the environment, 
the better our predictions. We show the usefulness of the new bounds for some 
deterministic environments fi=a. 

Assume all observations are identical, i.e. a = x\X\X\.... Further assume that X 
is huge and K(x\)=\o%<^X\, i.e. x\ is a typical/random/complex element of X. For 
instance if x\ is a 256 3 color 512x512 pixel image, then \X\ = 256 3x512x512 . Hence the 
standard bound (6) on the number of errors D 1:00 /ln2<K( f i)=K(x 1 ) = 3-2 21 is huge. 
Of course, interesting pictures are not purely random, but their complexity is often 
only a factor 10.. 100 less, so still large. On the other hand, any reasonable prediction 
scheme observing a few (rather than several thousands) identical images, should 
predict that the next image will be the same. This is what our posterior bound gives, 
D 2 .. oo (x 1 )^(K(fi\x 1 )+K(l))ln2=0, hence indeed M makes only Et=i E h] = 
errors by Corollary 2(ii), significantly improving upon Solomonoff's bound K(xi)\n2. 

More generally, assume a = xou, where the initial part x = xw contains all in- 
formation about the remainder, i.e. K(fj,\x)=K(u\x)=0. For instance, x may be a 
binary program for n or e, and uj its | |-ary expansion. Sure, given the algorithm 
for some number sequence, it should be perfectly predictable. Indeed, Theorem 1 
implies Di + i :(XJ <K(l), which can be exponentially smaller than Solomonoff's bound 
K(n) (=/ if K(x)=£(x)). On the other hand, K(l) >log 2 / for most /, i.e. is larger 
than the 0(1) that one might hope for. 
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Logarithmic versus constant accuracy. Thus there is one blemish in the bound. 
There is an additive correction of logarithmic size in the length of x. Many theorems 
in algorithmic information theory hold to within an additive constant, sometimes 
this is easily reached, sometimes with difficulty, sometimes one needs a suitable com- 
plexity variant, and sometimes the logarithmic accuracy cannot be improved [LV97]. 
The latter is the case with Theorem 1: 

Lemma 3. For ^ = {0,1}, for any positive computable measure fi, there exists a 
computable sequence a<E{0,l}°° such that for any /GlV 

A:oo(a<i) > A:i(«<i) = E ^ a <^ n Wb^)-^ m - 

be{o,i} ^ 1 <l ' 

Proof. Let us construct such a computable sequence aG{0,l}°° by induction. As- 
sume that a K i is constructed. Since /i is a measure, either fj,(0\a < i)>c or ji{l\a < i)>c 
for c:= [31n2] _1 < \. Since // is computable, we can find (effectively) &G {0,1} such 
that fi(b\a < i)>c. Put ai = b. 

Let us estimate M(q;/|q; < ;). Since a is computable, M(a<j)>l. We claim that 
M(a < iai)<2~ K( - l \ Actually, consider the set {a^cn |/>0}. This set is prefix free and 
decidable. Therefore P(l) = M(a < ia{) is an enumerable function with ^2 t P(l) < 1, 
and the claim follows from the coding theorem. Thus, we have M(a/|o; < /)<2~ ft: ^) 
for any I. Since n^a^a^) >c, we get 

(hi M A*(&l«<l) <t I- I M C • 1 P 

> |i O Q<i n — — > uiai «<;) m — ttttt + mm pm — ; — ; 

66^1} M (^l«<0 " 2^(0 p6[0,l-c]^ M( ai \ a<l ) 

^ cAT(Z) In 2 

□ 

A constant fudge is generally preferable to a logarithmic one for quantitative 
and aesthetical reasons. It also often leads to particular insight and/or interesting 
new complexity variants (which will be the case here). Though most complexity 
variants coincide within logarithmic accuracy (see [SchOO, Sch02a] for exceptions), 
they can have very different other properties. For instance, Solomonoff complexity 
KM(x) = — log 2 M(x) is an excellent predictor, but monotone complexity Km can be 
exponentially worse and prefix complexity K fails completely [Hut03c, Hut06]. 

Exponential bounds. Bayes is often approximated by MAP or MDL. In our 
context this means approximating KM by Km with exponentially worse bounds 
(in deterministic environments) [Hut03c]. (Intuitively, since an error with Bayes 
eliminates half of the environments, while MAP/MDL may eliminate only one.) 
Also for more complex "reinforcement" learning problems, bounds can be 2 K ^ 
rather than K(fi) due to sparser feedback. For instance, for a sequence x\XiX\... if 
we do not observe x\ but only receive a reward if our prediction was correct, then the 
only way a universal predictor can find X\ is by trying out all \X\ possibilities and 
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making (in the worst case) \X\—1=2 K ^ errors. Posterization allows us to boost such 
gross bounds to useful bounds 2 K ^ Xl ^ — 0(l). But in general, additive logarithmic 
corrections as in Theorem 1 also exponentiate and lead to bounds polynomial in I 
which may be quite sizeable. Here the advantage of a constant correction becomes 
even more apparent [Hut05, Problems 2.6, 3.13, 6.3 and Section 5.3.3]. 



5 More Bounds and New Complexity Measure 

Lemma 3 shows that the bound in Theorem 1 is attained for some binary strings. 
But for other binary strings the bound may be very rough. (Similarly, K{x) is 
greater than £(x) infinitely often, but K(x) <C i(x) for many "interesting" x.) Let 
us try to find a new bound, which does not depend on £(x). 

First observe that, in contrast to the unconditional case (6), K(fi) is not an upper 
bound (again by Lemma 3). Informally speaking, the reason is that M can predict 
the future very badly if the past is not "typical" for the environment (such past x 
have low /i-probability, therefore in the unconditional case their contribution to the 
expected loss is small). So, it is natural to bound the loss in terms of randomness 
deficiency d^(x), which is a quantitative measure of "typicalness" . 

Theorem 4. For any computable measure \i and any x,y g{0,1}* it holds 

Theorem 4 is a variant of the "deficiency conservation theorem" from [VSU05] . 
We do not know who was the first to discover this statement and whether it was 
published (the special case where \i is the uniform measure was proved by An. Much- 
nik as an auxiliary lemma for one of his unpublished results; then A. Shen placed a 
generalized statement to the (unfinished) book [VSU05]). 

Now, our goal is to replace K{ji) in the last bound by a conditional complexity 
of ii. Unfortunately, the conventional conditional prefix complexity is not suitable: 

Lemma 5. Let ^ = {0,1}. There is a constant Cq such that for any lelV, there 
are a computable measure ji and xE{0,l} 1 such that 

K(/i\x) < Co, \dfj,(x)\ < Co, and 

D l+u+1 {x)= Yl ^ ln ^A- min2 - 

fee{o,i} ^ 1 ' 

Proof. For I G IV, define a deterministic measure /j, 1 such that /i 1 is equal to 1 on 
the prefixes of Z 1°° and is equal to otherwise. 

Letx = 0'. Then pi l (x) = 1, fi l (x0) = 0, /j, l (xl) = 1. Also 1 > M(x) > M(x0) > 
M(0°°)=1 and (as in the proof of Lemma 3) M(xl)^2~ x W. Trivially, d^x) = 
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log 2 M(x) = l, and K(fj, l \x)—K(fj, l \l)—0. Thus, K(fJ\x) and d fJl i(x) are bounded by a 
constant Co independent of /. On the other hand, 

y f, l (b\x)ln^\ = In— 1 $K(l)\n2. 

b mn Mm M(1|x) ~ 

(One can obtain the same result also for non-deterministic //, for example, taking 
/J mixed with the uniform measure.) □ 

Informally speaking, in Lemma 5 we exploit the fact that K(y\x) can use the 
information about the length of the condition x. Hence K(y\x) can be small for a 
certain x and is large for some (actually almost all) prolongations of x. But in our 
case of sequence prediction, the length of x grows taking all intermediate values and 
cannot contain any relevant information. Thus we need a new kind of conditional 
complexity. 

Consider a Turing machine T with two input tapes. Inputs are provided without 
delimiters, so the size of the input is defined by the machine itself. Let us call such 
a machine twice prefix. We write that T(x,y) = z if machine T, given a sequence 
beginning with x on the first tape and a sequence beginning with y on the second 
tape, halts after reading exactly x and y and prints z to the output tape. (Obviously, 
if T(x,y) = z, then the computation does not depend on the contents of the input 
tapes after x and y.) We define 

C T {y\x) := min{£(p) | 3k < i(x) : T(p,x 1:k ) = y} . 

Clearly, Ct{v\x) is an enumerable from above function of T, x, and y. Using a 
standard argument [LV97], one can show that there exists an optimal twice prefix 
machine U in the sense that for any twice prefix machine T 

Cu(y\x) £ C T (y\x) . 

Definition 6. Complexity monotone in conditions is defined for some fixed optimal 
twice prefix machine U as 

K*(y\x*) := Cu(y\x) = min{£(p) \ 3k < £(x) : U(p,x 1:k ) = y} ■ 

Here * in x* is a syntactical part of the complexity notation K*(-\-*), though one 
may think of K*{y\x*) as of the minimal length of a program that produces y given 
any z = x*. 

Theorem 7. For any computable measure \i and any x,y G X* it holds 

^2 J ^^tKM^) + K{\d ll {x)^. 
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Note. One can get slightly stronger variants of Theorems 1 and 7 by replacing the 
complexity of a standard code of \i by more sophisticated values. First, in any 
effective encoding there are many codes for every ji, and in all the upper bounds 
(including Solomonoff 's one) one can take the minimum of the complexities of all the 
codes for /i. Moreover, in Theorem 1 it is sufficient to take the complexity of fi x = 
fJ>('\x) (and it is sufficient that /i x is enumerable, while \i can be incomputable). For 
Theorem 7 one can prove a similar strengthening: The complexity of p, is replaced 
by the complexity of any computable function that is equal to /i on all prefixes and 
prolongations of x. 

To demonstrate the usefulness of the new bound, let us again consider some 
deterministic environment fi=a. For ^ = {0,1} and a = x°° with x = n l, Theo- 
rem 1 gives the bound K(/j,\n) + K(n)=K(n). Consider the new bound K*(fj,\x*) + 
K(\d tl (x)~\). Since \i is deterministic, we have d^(x) =\og 2 M(x)= — K(n), and 
K(\d ll (x)\)=K(K(n)). To estimate K*([/,\x*), let us consider a machine T that 
reads only its second tape and outputs the number of 0s before the first I. Clearly, 
C T (n\x) = 0, hence K^\x*)=0. Finally, K^(fx\x*) + K(\d^(x)])^K(K(n)), which 
is much smaller than K(n). 

6 Properties of the New Complexity 

The above definition of is based on computations of some Turing machine. Such 
definitions are quite visual, but are often not convenient for formal proofs. We will 
give an alternative definition in terms of enumerable sets (see [US96] for definitions 
of unconditional complexities in this style), which summarizes the properties we 
actually need for the proof of Theorem 7. 

An enumerable set E of triples of strings is called K*- correct if it satisfies the 
following requirements: 

1. if (p,x,yi)eE and (p,x,y 2 )eE, then yi=y 2 ; 

2. if (p,x,y) EE, then (p',x',y) EE for all p' being prolongations of p and all x' 
being prolongations of x; 

3. if (p,x',y) G E and (p',x,y) G E, and p is a prefix of p' and a; is a prefix of x', 
then (p,x,y) G E. 

A complexity of y under a condition x w.r.t. a set E is 

C E (y\x) = min{£(p) \ (p,x,y) G E} . 

A ^-correct set E is called optimal if 

C E (y\x)£c E ,(y\x) 
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for any ^-correct set E'. One can easily construct an enumeration of all ^-correct 
sets, and an optimal set exists by the standard argument. 

It is easy to see that a twice prefix Turing machine T can be transformed to 
a set E such that Ct(v\x) =Ce{u\x). The set E is constructed as follows: T is 
run on all possible inputs, and if T(p,x) —y, then pairs (p',x',y) are added to E for 
all p' being prolongations of p and all x' being prolongations of x. Evidently, E 
is enumerable, and the second requirement of ^-correctness is satisfied. To verify 
the other requirements, let us consider arbitrary (p'^x^yi) EE and {p 2 ,x 2 ,y 2 ) SzE 
such that p[ and p' 2 , x[ and x' 2 are comparable (one is a prefix of the other). By 
construction of E, there are pi being prefixes of p\ and Xi being prefixes of x\ such 
that T(pi,Xi)=yi for i = 1,2. Clearly, p\ andp 2 , X\ and x 2 are comparable too. Since 
replacing the unused part of the inputs does not affect the running of the machine 
T and comparable words have a common prolongation, we get p±=P2, Xi=x 2 , and 
y\ — yi. Thus E is a ^-correct set. 

The transformation in the other direction is impossible in some cases: the set 
E = {(0 h( - n ^p,0 n lq,0) \nElN,p,qE{0,l}*}, where h(n) is if the n-th Turing machine 
halts and 1 otherwise, is i^-correct, but does not have a corresponding machine T. 
(Assume that such a machine T exists. If the n-th machine halts, then (e,O n l,0) EE 
and thus T does not read the input tape at all. If the n-th machine does not halt, 
then (0,0 n l,0) E E and (l,O n l,0) ^ E and thus T has to read first symbol on the 
input tape. Therefore, one can use T to solve the halting problem.) However, we 
conjecture (but cannot prove) that for every set E there exists a machine T such 
that C T (x\y)±C E (x\y). 

Probably, the requirements on E can be even weaker, namely, the third require- 
ment might be superfluous. Let us notice that the first requirement of ^-correctness 
allows us to consider the set E as a partial computable function: E(p,x) = y iff 
(p,x,y) E E. The second requirement says that E becomes a continuous function 
if we take the topology of prolongations (any neighborhood of (p,x) contains the 
cone {{p*,x*}}) on the arguments and the discrete topology {{y} is a neighborhood 
of y) on values. It is known (see [US96] for references) that different complexities 
(plain, prefix, decision) can be naturally defined in a similar "topological" fashion. 
We conjecture the same is true in our case: an optimal enumerable set satisfying 
the requirements (1) and (2) (obviously, it exists) specifies the same complexity (up 
to an additive constant) as an optimal twice prefix machine. 

It follows immediately from the definition(s) that K*(y\x*) is monotone as a 
function of x: K*(y\xz*) <K*(y\x*) for all x, y, z. 

The following lemma provides bounds for K*(x\y*) in terms of prefix complexity 
K. The lemma holds for all the definitions of K*(x\y*) above. 

Lemma 8. For any x,yEX* it holds 

K{x\y) t K,{x\y*) t min {K{x\y ld ) + K(l)} t K{x) . 

In general, none of the bounds are equal to K*(x\y*) even within a o(K(x)) term, 
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but they are attained for certain y: For every x there is a y such that 

K(x\y)=0 and K*(x\y*) ± min {K(x\y 1:l ) + K(l)} = K(x) , 

i<e(y) 

and for every x there is a y such that 

K(x\y) = K*(x\y*)=0 and min {K(x\y 1:l ) + K(l)} = K{x) . 

i<i{y) 

Proof. The first inequality is trivial (any twice-prefix machine is also a prefix ma- 
chine in the first argument), as well as the last one (consider 1 = 0). Let us describe a 
twice prefix machine that provides K*(x\y*)<mmi<M y \{K(x\yi : i)+K(l)}. The first 
tape contains a prefix code pi of I followed by a prefix code p for x under condition 
yi-j, and the second tape contains y. The machine reads the pi on the first tape and 
reconstructs the number /, then reads / bits from the second tape, and then reads p 
using these bits as the condition. Thus, K # (x\y*)<£{pi)+£{p)<K(l)+K(x\yia)- 
Let us show that the bounds are attained. 

Let us observe that K(x)<K*(x\O n *) for all x and n. Actually, let P(x) = 
max{2~ £ ^ | 3n{p,0 n ,x) eE} (which implies — \og 2 P(x) < K*(x\O n *) for all n). Ob- 
viously, P(x) is enumerable. Further, ^2 x P{x) < 1 since ^2 x P(x) is a sum of 2~ e ^ 
over a prefix-free set of p (Assume the converse, p is a prefix of q, and (p,O n ,x) EE, 
(q,O m ,y) E E for some n, m, and different x, y. By the second requirement of in- 
correctness, (q,O nmx ^ m ' n \x) G E, {q,0 max { m ' n } ,y) G E. By the first requirement, x = y, 
contradiction.) Thus, by the coding theorem, K(x)< — log 2 P{x)<K*(x\0 n *). 

To get the first example, for arbitrary x, let us take y = 0™ such 
that n is the number of x in some ordering of all binary strings. Then 
K(x\y)=K(x\n)=0, K*{x\y*)=K(x), and we have mmi{K{x\yia) + K{l)}=K{x) 
since K*(x\y*)<mmi{K(x\y 1:l ) + K(l)}^K(x\n) + K(n)=K(x). 

To get the second example, for an arbitrary x let us take n such that K{1) > K(x) 
for all l>n. Then put y = n lx, where x is any prefix code of x (e.g. , x = i ^ x ' ) lx). 
Obviously, K(x\y)=0 and K*(x\y*)=Q. Consider K(x\yi : i)+K(l). If l<n, then it is 
equal to K(x\O l )+K(l)^K({x,l))^K(x). If l>n, then K(l)>K(x) by definition of 
n. □ 

Corollary 9. The future deviation of M t from ji t is bounded by 

oo 

t=i+i 

5 [wm{K(»\u 1:i )+K(i)} + K( \d^u ld )] )] In 2 . 

i<l 

Let us note that if u is yU-random, then K^d^Uia^^K^d^Ui^^+K^K^/j)), 
and therefore we get the bound, which does not increase with /, in contrast to the 
bound (i) in Corollary 2. 

Finally, let us point out one more approach to defining the complexity K*. The 
survey [US96] provides "encoding-free" definitions of the main complexities. In a 
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similar fashion, fT* could be defined as a minimal (up to an additive constant) 
function with the following properties: 

1. The function K*(y\x*) is non-negative and co-enumerable; 

2. K*(y\xz*) <K*(y\x*) for all x, y, z; 

3. J22~ K * {vlx * ] < 1 for all x. 

y 

Probably, condition 2 expressing strict monotonicity is superfluous, and both con- 
ditions 2 and 3 can be replaced by 

2'. For any set A = {(x,y)} such that all the first elements x of the pairs from A 

have a common prolongation and the second elements y are different for all 

pairs from A, it holds £ 2~ K ^ X ^ < 1. 

(x,y)eA 

It is easy to check that these properties are satisfied for all previously defined "ver- 
sions" of K*. We conjecture that all the definitions are equivalent, though we cannot 
prove this. 

7 Proof of Theorem 7 

If /i(x)—0, then d^x) = 00 and the bound trivially holds. Below assume that /i(x)^0 
and thus d^x) is finite. 

The plan is to get a statement of the form 2 d fi(xy)<M(xy), where dmd^(x) = 
log 2 . To this end, we define a new semimeasure v: we take the set S—{z\d /Ji (z)> 
d} and put v to be 2 d ji on prolongations of z G S; this is possible since S has ji- 
measure 2~ d . Then we have v{z) < C ■ M(z) by universality of M. However, the 
constant C depends on fi and also on d. To make the dependence explicit, we 
repeat the above construction for all numbers d and all semimeasures // , obtaining 
semimeasures Vd,T, and take z/ = ^2~ i ^ •2~ k ^u ( i j t- This construction would give 
us the term K{ji) in the right-hand side of Theorem 7. To get K*((/,\x*), we need a 
more complicated strategy: instead of a sum of semimeasures h>d,T, for every fixed d 
we sum "pieces" of v dT at each point z, with coefficients depending on z as well as 
on d and T. 

Now proceed with the formal proof. Let {h t }t£n be any (effective) enumeration 
of all enumerable semimeasures. For any integer d and any T, put 

Sd,T ■= {z\ £ fi T (v) + 2~ d M(z)>l}. 

vex^)\{z} 

The set Sd,r is enumerable given d and T. 

Let E be the optimal ^-correct set (satisfying all three requirements), E(p,z) 
is the corresponding partial computable function. For any zEX* and T, put 

X d , T (z) := max{2-^) | 3k < t{z) : z 1:k G S d , T and £?(p, z 1:k ) = T} 
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(if there is no such p, then Xd,T(z) — 0). Put 

u d (z) := J>, T (z) • 2V(*) ■ 

T 

Obviously, this value is enumerable. It is not a semimeasure, but it has the following 
property. 

Claim 10. For any prefix-free set A, 

zeA 

This implies that there exists an enumerable semimeasure v d such that v d {z) > 
v d {z) for all z. Actually, to enumerate v d , one enumerates v d {z) for all z, and at 
each step the current approximation of v d {z) is the maximum of the current approx- 
imations of i> d {z) and ^2 ue x v d(zu) . Trivially, this provides v d [z) > Yluex^di 2,11 ) ■ To 
show that v d (e) < 1, let us note that at any step of enumeration the current approx- 
imation of v d {e) is the sum of current approximations of v d {z) over some prefix- free 
set, and thus is bounded by 1. Put 

v{z) := £V*«>i/ d (z). 

d 

Clearly, v is an enumerable semimeasure, thus v(z)<iM(z). Let pL be an arbitrary 
computable measure, and x,yEX*. Let pE{0,l}* be a string such that K*(fi,\x*) = 
£(p), E(p,x)—T, and fi = fi T . Put d=\d ll (x)~\—l, i.e., d fJ ,(x) — l<d<d IJi (x). Hence 
fj,(x) <2~ d M(x). Since fj,=fj T is a measure, we have J2 v ex e ( x ) ^ ( v ) = ^ ' an< ^ therefore 
xeS d<T - By definition, \ d ,T{xy) >2~ t{ P\ thus i> d {xy)>2-^2 d n(xy), and 

2~ K ^2- e ^2 d fi{xy) < v{xy) Z M{xy) . 

Replacing 2 d in the left-hand side by a smaller value 2 d ^ x ) -1 , after trivial transfor- 
mations we get 

^jjrr-AK^\x*) + K(d), 

M{y\x) 

which completes the proof of Theorem 7. 
Proof of Claim 1 0. First observe that for all z G S di x 

M(z) > 2V>) , 

since ^ 

H T (v) + 2~ d M(z) > 1 and fi T (v) < 1 

by definition of S dt T and by the semimeasure property, respectively. To prove the 
claim we will group items with the same /x T , replace sums of /i T - measures of several 
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z by the /z T -measure of their common prefix from Sd,T, change fi T to M using the 
inequality above, and finally show (using "prefix-free" properties of K*) that the 
coefficients of M(z) in the sum are small. By definition, 

E = E E v^) ■ 2 = E E ■ 2 V(*) • 

Let us estimate the inner sum. Let 7t<2,t(z) be the string p that gives the maximum 
in the definition of \ d ^ T (z) (if there are several such p we always take, say, the 
lexicographically first), that is X djT (z)—2~ e ^ and there exists z' being a prefix of z 
such that z' eS dj T and E(p,z') —T. Let C,(1,t{z) be the shortest of such z'. It is easy 
to see that Cd,r(Cd,r(^)) =Cd,r(^) and A d ,r(Cd,T(z)) = Ad.T^). 

Evw-W) = E E vw-2yw = E E vw-2yw 

< E Xd ^ v ) • 2< ^) ^ E A <^) • 2< ^) 

v. 3zeA:( dtT (z)=v v. Cd,T( v ) =v 

< E ^,t(vW(v). 

v- Cd,r(v)=v 

In the first inequality we used that (d,r(z) is a prefix of z, that the set A is prefix 
free, and summed the /x T (z) to fi T (v). Now we can forget about A. If (^(-2)=^ for 
some z, then C<2,t(^) = C^HC^t^)) = f, and we get the second inequality. The last 
inequality holds since (d,r(v) belongs to Sd,r- Thus, we need to bound the sum 

E = E { E A ^)) M ^ • 

^ v=Cd,T(f) « XT: «=Cd,T(f) / 

We say that a function /: X* — >[0,1] is unit-summable along any sequence if for 
any zeX* 

t{z) 

E/(^) ^ !■ 

i=i 

Claim 11. The function f{y) = Yl ^d,r(v) is unit-summable along any sequence. 

T: v=C d , T (vj 

Lemma 12. Let v be a semimeasure. If a function f is unit-summable along any 
sequence, then 

£/(*M*) < i- 

z£X* 

This concludes the proof of Claim 10. □ 

Proof of Lemma 12. Since f(z) and u(z) are non-negative, it is sufficient to 
prove Yli(z)<ni '( z ) l '( z ) — 1 fo r all n. Also we can assume that v is a measure (the 
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sum does not decrease, if v is increased to a measure). 



E /(*m*) 



E /(*) E = E E /(*m«) 



£(z)<n 



£(z)<n £(v)=n, <(t))=n ^(z)<n, 

z prefix of v z prefix of v 



n 



E E/w^) ^ E k«)<i- 




□ 



Proof of Claim 11. Take any zGA"*. Let us show that 



E < i- 



u prefix of z, 
T: «=Cd,T(f) 



Recall that if A d) r(u)^0, then A d>T (u)=2" <(,r ''- T(t ' )) . We will show that the set B(z) = 
{^d,T(v) | v = (d,T(v), v is a prefix of 2} is prefix free, and if Tr djTl (vi) = nd,T 2 (v 2 ) G 
-B(z), then v\ — v 2 and Ti = T 2 . Consequently, 



Assume the converse, that there exist different i>j, Tj, i = 1,2, such that px—'Kd^ (vi) is 
a prefix (proper or not) of P2 = ^d,T 2 ( v 2), v\ and t>2 are prefixes of z, and fj = 0,^(^1) • 
By definition of (, we have Vi G and Tj = E(pi,Vi). Hence, by the second 
requirement of ^-correctness, 7\ = E(pi,v 1) =E(p 2 ,z) — E(p 2 ,v 2 ) = T 2 . Let T = Ti = 



Let us show that v\ = v 2 too. Since they both are prefixes of z, one of them is 
a prefix of the other. Suppose vi is a prefix of v 2 : By the second requirement of 
^-correctness, E(p 2 ,vi) =E(pi,v\) —T. By definition, Cd,r(v 2 ) is the shortest prefix 
of v 2 belonging to Sd,T and such that E(p 2 ,-) —T, therefore (d,T{v 2 ) is a prefix of v\, 
and thus v\ —v 2 . Suppose v 2 is a prefix of v±. Since i?(pi,t>i) =T and E(p 2 ,v 2 ) —T, 
we have E(pi,v 2 ) —T by the third requirement of ^-correctness. As before, we get 
Cd,r(vi) is a prefix of v 2 , and t>! = f 2 . □ 

8 Discussion 

Conclusion. We evaluated the quality of predicting a stochastic sequence at an 
intermediate time, when some beginning of the sequence has been already observed, 
estimating the future loss of the universal Solomonoff predictor M. We proved 
general upper bounds for the discrepancy between conditional values of the predictor 
M and the true environment /i, and demonstrated a kind of tightness for these 



E a^t,) = e 2 ' e(p) < 1 • 



v prefix of z, pG-B(z) 
T:v=C, d , T (v) 



T 2 . 
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bounds. One of the bounds is based on a new variant of conditional algorithmic 
complexity if*, which has interesting properties on its own. In contrast to standard 
prefix complexity K, is a monotone function of conditions: K*(y\xz*) <K*(y\x*). 

General Bayesian posterior bounds. A natural question is whether posterior 
bounds for general Bayes mixtures based on general M 3 /i could also be derived. 
The mixture representation (3) can be written as a posterior representation 

t(y\x) = w v( x ) v (v\ x ) > w^{x)^{y\x), where w v (x) := ™vjr\ 
ueM ^ [X) 

is the posterior belief in v after observing x (and w v is the prior). This immediately 
implies the bound Di :00 <lnw M (c<j <i ) _1 . Strangely enough, for A4=Ai u , \og2W~ 1 := 
K{y) does not imply \og 2 w^(x)~ 1 = K(fj,\x), not even within logarithmic accuracy, 
so it was essential to consider A :00 . It would be interesting to derive bounds on 
Di-oo or Inw^x)' 1 for general M. similar to the ones derived here for M. —M.y. 

Online classification. All considered distributions p[x) (in particular £, M, and 
fi) may be replaced everywhere by distributions p(x\z) additionally conditioned on 
some z. The z-conditions nowhere cause problems as they can essentially be thought 
of as fixed (or as oracles or spectators). An (i.i.d.) classification problem is a typical 
example: At time t one arranges an experiment z t (or observes data z t ), then tries 
to make a prediction, and finally observes the true outcome x t with probability 
n(x t \z t ). In this case A4 — {y{x\ :n \zi :v ) — v{x\\z-\) ■ ... •v{x n \z n )}. (Note that £ is 
not i.i.d). Solomonoff's bound K{p)\n2 in (6) holds unchanged. Compared to the 
sequence prediction case we have extra information z, so we may wonder whether 
some improved bound K(/i\z) or so, holds. For a fixed z this can be achieved by 
also replacing 2~ K ^ in (3) by 2~ K ^ Z \ But if at time t only z\-t is known like in the 
classification example, this leads to difficulties (£ is no longer a (semi)measure, which 
sometimes can be corrected [PH04]). Alternatively we could keep definition (3) but 
apply it to the (chronologically correctly ordered) sequence z 1 XiZ 2 X2-.., condition 
by (1) to zi :t , and try to derive improved bounds. 

More open problems. Since D 1:oo is finite, one may expect that the tails D l:oo tend 
to as Z— >oo. However, as Lemma 3 implies, this holds only with probability 1: for 
some special a we have even A:oo(«</)>|-Z^(Z) ^^oo. It would be very interesting 
to find a wide class of a such that -D/ :00 (tt</) — »0. The natural conjecture is that 
one should take /i-random a. Another (probably, closely related) task is to study 
the asymptotic behavior of _Z\~*(/z|a:</*). It is natural to expect that if*(//|a<j*) is 
bounded by an absolute constant (independent of //) for "most" a and for sufficiently 
large Z. Finally, (dis)proving our conjectured equality of the various definitions of 
we gave, would be interesting and useful. 
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